Speech recognition training

ABSTRACT

A method and system performs speech recognition training using Hidden Markov Models. Initially, preprocessed speech signals that include a plurality of observations are stored by the system. Initial Hidden Markov Model (HMM) parameters are then assigned. Summations are then calculated using modified equations derived substantially from the following equations, wherein u≦v&lt;w: 
     
         P(X.sub.u.sup.v)=P(x.sub.u.sup.v)P(x.sub.v+1.sup.w) 
    
     and 
     
         Ω.sub.ij (x.sub.u.sup.w)=Ω.sub.ij 
    
      (x u   v )P(x v+1   w )+P(x u   v )Ω ij  (x v+1   w ) 
     The calculated summations are then used to perform HMM parameter reestimation. It then determines whether the HMM parameters have converged. If they have, the HMM parameters are then stored. However, if the HMM parameters have not converged, the system again calculates summations, performs HMM parameter reestimation using the summations, and determines whether the parameters have converged. This process is repeated iteratively until the HMM parameters have converged.

BACKGROUND OF THE INVENTION

The present invention is directed to speech recognition training. Moreparticularly, the present invention is directed to speech recognitiontraining using Hidden Markov Models.

A popular approach to performing speech recognition is to use HiddenMarkov Models (HMMs). An HMM is a probabilistic function of a Markovchain and can be defined as {S,X,II,A,B}, where S={s₁, s₂, . . . , s_(n)} are the Markov chain states, X denotes the HMM output (observation)set, II is a vector of state initial probabilities, A= a_(ij) !_(n),n isa matrix of state transition probabilities (a_(ij) =Pr{s_(j) |s_(i) }),and B(x)=diag {b_(j) (x)} is a diagonal matrix of the output xεXconditional probability densities in state s_(j). If X is discrete, B(x)is a matrix of probabilities (b_(j) (x)=Pr {x|s_(j) }). Without loss ofgenerality, states are denoted by their indices (s_(i=i)).

In order for a device to perform speech recognition, that device mustfirst fit HMMs to experimental data which entails generating modelparameters. This process is referred to as "training" the speechrecognition device.

There are a number of well-known ways for building a Hidden Markov Modelfor speech recognition. For example, as set forth in L. Rabiner et al,"Fundamentals of Speech Recognitionp", Chapter 6, Section 6.15, a simpleisolated word recognition model can be created by assigning each word ina vocabulary a separatic model, and estimating the model parameters (A,B, π) that optimizes the likelihood of the training set observationvectors for that particular word. For each unknown word to berecognized, the system (a) carries out measurements to create anobservation sequence X via feature analysis of the speech correspondingto the word; (b) calculates the likelihood for all possible word models;and (c) selects the word whose model likelihood is highest. Examples ofthe other speech recognition systems using Hidden Markov Models can befound in Rabiner et al. and in U.S. Pat. Nos. 4,587,670 to Levinson etal. (reissued as Re33,597) and 4,783,804 to Juang et al. which areincorporated by reference herein.

There are various known methods to perform training using HMMs byoptimizing a certain criterion (e.g., a likelihood function, an aposteriori probability, an average discrimination measure, etc.).However, these known methods all have drawbacks. For example, knownmethods that use the Newton-Raphson algorithm or the Conjugate Gradientalgorithm, both of which are disclosed in W. H. Press, et al.,"Numerical Recipes in C", Cambridge University Press (1992), convergefast in a small vicinity of optimum, but are not robust. Therefore, ifparameter values are not very close to optimum, they might no converge.

Further, a known method that uses the Baum-Welch algorithm inconjunction with the forward-backward algorithm (the "Baum-Welch"method) is disclosed in L. E. Baum et al., "A Maximization TechniqueOccurring in the Statistical Analysis of Probabilistic Functions ofMarkov Chains", Ann. Math. Statist, 41, pp. 164-171 (1970). Trainingusing this method converges slowly and requires a large amount ofmemory. therefore, training using this method must be implemented on apowerful computer with a large amount of memory.

Various approaches are known that speed up the Baum-Welch method. Forexample, W. Turin, "Fitting Probabilistic Automata via the EMAlgorithm", Commun. Statist.--Stochastic Models, 12, No. 3, (1996) pp,405-424 discloses that the speed of the forward-backward algorithm canbe increased if observation sequences have repeated patterns and, inparticular, long stretches of repeated observations. S. Sivaprakasam et.al., "A Foward-Only Procedure for Estimating Hidden Markov Models",GLOBECOM (1995) discloses that in the case of discrete observations, aforward only algorithm can be used that is equivalent to theforward-backward algorithm. However, these known approaches requirespecialized situations (i.e., long stretches of repeated observationsand discrete observations).

Based on the foregoing, there is a need for a speech recognitiontraining method and apparatus for generalized situations that is robustand does not require a large amount of memory.

SUMMARY OF THE INVENTION

The present invention is a method and system for performing speechrecognition training using Hidden Markov Models that satisfies the aboveneeds and more. In one embodiment, the present invention first storespreprocessed speech signals that include a plurality of observations.Initial Hidden Markov Model (HMM) parameters are then assigned.Summations are then calculated using modified equations derivedsubstantially from the following equations, wherein u≦v≦w:

    P(x.sub.u.sup.v =P(x.sub.u.sup.v)P(x.sub.v+1.sup.w)

and

    Ω.sub.ij (x.sub.u.sup.w)=Ω.sub.ij (x.sub.u.sup.v)P(x.sub.v+1.sup.w)+P(x.sub.u.sup.v)Ω.sub.ij (x.sub.v+1.sup.w)

The calculated summations are then used to perform HMM parameterreestimation. The present invention then determines whether the HMMparameters have converged. If they have, the HMM parameters are thenstored. However, if the HMM parameters have not converged, the presentinvention again calculates summations, performs HMM parameterreestimation using the summations, and determines whether the parametershave converged. This process is repeated iteratively until the HMMparameters have converged.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is flow diagram of a speech algorithm training apparatus inaccordance with the present invention.

FIG. 2 is a flow chart of the steps performed by the prior art toperform summation equations.

FIG. 3 is a flowchart of the steps performed by one embodiment of thepresent invention to calculate summations.

FIG. 4 is a flowchart of the steps performed by another embodiment ofthe present invention to calculate summations.

FIG. 5 is a flowchart of the steps performed by another embodiment ofthe present invention to calculate summations.

DETAILED DESCRIPTION

The present invention is a method and apparatus for performing speechalgorithm training. The present invention implements unique algorithmsin order to generate HMM parameters and summations that are used inconjunction with the Baum-Welch algorithm. The use of the algorithmsenables the speech recognition training of the present invention be morerobust than the prior art and require less memory.

FIG. 1 is flow diagram of a speech algorithm training apparatus inaccordance with the present invention. Because the present inventiondoes not require a large amount of memory, it can be implemented on arelatively small computer compared to prior art speech algorithmtraining methods. In one embodiment, the present invention isimplemented on a general purpose computer that includes a processor anda storage device. In another embodiment, the present invention isimplemented on a computer with parallel processors.

Storage device 10 in FIG. 1 stores preprocessed speech samples, ortraining data, in the form of feature vectors. Preprocessing speechsamples and converting them to vectors is well known and is disclosedin, for example, L. Rabiner et al., "Fundamentals of SpeechRecognition", Prentice Hall, Englewood Cliffs, N.J. (1993).

For example, a sequence of measurements can be made on a speech inputsignal to define a test pattern. For speech signals, the featuremeasurements are usually the output of some type of spectral analysistechnique, such as a filter bank analyzer, a linear predictive coding("LPC") analysis, conversion to cepstral coefficients (or delta cepstralcoefficients or delta energy, etc.), or a discrete Fourier transform("DFT") analysis. Such acoustic feature analysis is well known in theart. For example, the two dominant methods of spectral analysis--namelyfilter-bank spectrum analysis and LPC analysis--are discussed at lengthin Chapter 3 of Rabiner et al. The output of the acoustic featureanalysis is a time sequence of spectral feature vectors, or, as it isoften referred to, a speech pattern.

At step 12, HMM initialization is performed in a known manner byassigning initial HMM parameters. One known method for performing HMMinitialization uses the segmental K-means algorithm which is disclosedin L. Rabiner et al., "Fundamentals of Speech Recognition", PrenticeHall, Englewood Cliffs, N.J. (1993). The initial parameters are providedto step 14, described below, when the speech algorithm trainingapparatus is initialized.

At step 14 one of the modified forward-backward algorithms that will bedescribed in detail below performs calculations based on training datastored in storage unit 10 and old HMM parameters. The old HMM parametersare input from either step 12 when the speech algorithm trainingapparatus is initialized, or from step 20, described below. Thecalculations are used in step 16 described below.

At step 16, model reestimation is performed using the Baum-Welchequations to improve the parameters received from step 14. Modelreestimation using the Baum-Welch algorithm is an iterative procedurefor maximum likelihood estimation of the parameters. The Baum-Welchalgorithm finds the maximum iteratively. The Baum-Welch algorithm iswell known and is disclosed in, for example, L. E. Baum et al., "AMaximization Technique Occurring in the Statistical Analysis ofProbabilistic Functions of Markov Chains", Ann. Math. Statist, 41, pp.164-171 (1970), herein incorporated by reference.

When performing model reestimation in step 16 using the followingBaum-Welch equations are used: ##EQU1## is the estimated mean number oftransitions from state i to state j and observing a sequence x_(l) ^(T)=(x₁, x₂, . . . , x_(T)), ##EQU2## is the probability of being in statej at the moment t and observing x_(l) ^(T),

    B.sub.ij (t,x.sub.l.sup.T, θ.sub.p)=α.sub.i (x.sub.l.sup.t-1)a.sub.ij,p b.sub.j (x.sub.t ;Ψ.sub.j,p)β.sub.j (x.sub.t+1.sup.T)                                         (1f)

is the probability of transferring from state i to state j at the momentt and observing x_(l) ^(T). Equations (1a)--(1f) are similar to theequations for estimating parameters of Markov chains disclosed in P.Billingsley, "Statistical Methods in Markov Chains," Ann. Math. Statist,vol. 32, pp. 12-40 (1961). The only difference is that actually observedtransitions are replaced by the estimated mean number of transitions.

Equations (1a) and (1b) are used to fit a hidden Markov chain, whileequation (1c) is used to fit state observation probability densities.

Step 14 in FIG. 1 provides the summation calculations that are needed inequations (1b), (1c), (1d) and (1e). The method of performing step 14 inaccordance with the present invention requires less computing power andstorage space than the prior art method of calculating the summations.

At step 20, model convergence is checked by using one of the well knowncriteria disclosed in, for example, W. H. Press, et al., "NumericalRecipes in C", Cambridge University Press (1992). One criteria is todetermine how much the new model parameters from step 16 differ from theold ones. If the difference is greater than a predetermined value, step14 is performed again. If the difference is less than the predeterminedvalue, the model parameters are stored in storage unit 18. The HMM modelparameters can then be used by a speech recognition device in a knownway.

Probabilities of various events can be evaluated using the notion ofmatrix probability. It is convenient to introduce an observation xmatrix probability density as

    P(x)=AB(x)                                                 (2)

Using this notation the probability density of x_(l) ^(T) can beexpressed as ##EQU3## where 1 is a column vector of ones. Ifobservations are discrete this formula represents the sequenceprobability. The matrix probability density (or probability in thediscrete case) of the sequence is defined as ##EQU4## Then equation (3)can be written as

    P(x.sub.l.sup.T)=nP(x.sub.l.sup.T)1

Therefore, probability density of any event in the σ-field generated bythe observation sequences can be computed similarly using the matrixprobability of the event.

Let l<t₁ <t₂ <. . . <t_(k) <T be a partition of the interval 1,T!, thenmatrix probabilities defined by equation (4) can be written as:

    P(x.sub.l.sup.T)=P(x.sub.l.sup.t.sbsp.l)P(x.sub.t.sbsb.l.sub.+1.sup.t.sbsp.2) . . . P(x.sub.t.sbsb.k.sub.+1.sup.T)                    (5)

Each matrix probability P(x_(u) ^(v)) in this equation can be evaluatedindependently (possibly on parallel processors) using matrix forward orbackward algorithms:

    P(x.sub.u.sup.t+1) =P(x.sub.u.sup.t)P(x.sub.t+1), t=u, u+1, . . . , v-1

    P(x.sub.t.sup.v)=P(x.sub.t)P(x.sub.t+1.sup.v)t=v-1, v-2, . . . , u

or, more generally,

    P(x.sub.u.sup.w)=P(x.sub.u.sup.v)P(x.sub.v+1.sup.w), for u≦v<w (6)

It is also convenient to assume that P(x_(u) ^(v))=I is a unit matrixfor u>v. Fast matrix exponentiation algorithm can be applied if x_(u)=x_(u+1) =. . . =x_(v) =x.

Using these matrix probabilities we can write

    α(x.sub.l.sup.t)=nP(x.sub.l.sup.t), β(x.sub.t.sup.T)=P(x.sub.t.sup.T)1

and

    α(x.sub.l.sup.v)=α(x.sub.l.sup.u)P(x.sub.u+1.sup.v), β(x.sub.u.sup.T)=P(x.sub.u.sup.v-1)β(x.sub.v.sup.T)

The present invention performs the function of step 14 in FIG. 1 byutilizing a matrix form of the Baum-Welch algorithm. The matrix form isderived from the Expectation Maximization ("EM") algorithm, which isdisclosed in, for example, W. Turin, "Fitting Probabilistic Automata viathe EM Algorithm", Commun. Statist.--Stochastic Models, 12, No. 3,(1996) pp. 405-424, herein incorporated by reference.

The EM algorithm can be modified to be used for multiple observations.For example, suppose that there are several observation sequences {x_(k)}_(l) ^(T).sbsp.k, k=1,2, . . . , k. In this case the EM algorithm takesthe form ##EQU5## Equation (7c) can be solved analytically for discreteand exponential family observation probability distributions.

It follows from equations (7a) (7b) and (7c) that the main difficulty inapplying the Baum-Welch algorithm, which is a special case of the EMalgorithm, is computing sums of the type ##EQU6## where η_(t) are someweights. In order to develop an efficient algorithms for computing thesesums η_(t) B_(ij) (t,x_(l) ^(T),θ) can be presented in the followingmatrix form

    η.sub.t B.sub.ij (t,x.sub.l.sup.T,θ)=η.sub.t a.sub.ij b.sub.j (x.sub.t ;Ψ.sub.j)α.sub.i (x.sub.l.sup.t-1)β.sub.j (x.sub.t+1.sup.T)=α(x.sub.l.sup.t-1)W.sub.ij (x.sub.t)β(x.sub.t+1.sup.T)

where

    W.sub.ij (x.sub.t)=η.sub.t a.sub.ij b.sub.j (x.sub.t ;Ψ.sub.j)e.sub.i e.sub.j

e_(i) =(0,0, . . . ,0,1,0, . . . ,0) is a unit vector whose i-thcoordinate is 1. Thus, efficient algorithms for computing the followingsums need to be developed: ##EQU7##

The prior art for computing these sums is called the forward-backwardalgorithm, which is depicted in the flowchart set forth in FIG. 2. Itconsists of two parts; a forward part and a backward part. In theforward part, the forward vectors α(x_(l) ^(l)) are initialized toα(x_(l) ⁰)=π in step 30 and are evaluted in step 34 according to theequation

    α(x.sub.l.sup.l+1)=α(x.sub.l.sup.l)P(x.sub.t+l), t=0,1, . . . , T-1

and saved in memory at step 36. In the backward part, the backwardvectors are computed recursively in step 46 as β(x_(r+1) ^(r))=1, and

    β(x.sub.l.sup.r)=P(x.sub.l)β(x.sub.l+1.sup.T)

then β_(ij) (t, x_(l) ^(T), .sup.θ_(p)) are computed according toequation (1f) and η_(t) β_(ij) (t,x_(l) ^(T), .sup.θ_(p)) are added instep 48 to accumulators according to equation (1d). In step 52, theaccumulated sums are sent to step 16 of FIG. 1 which performs the modelparameter reestimation. Thus, the prior art requires storage for savingthe forward vectors which is proportional to the observation sequencelength T.

direct application of this equation, accordingly, usually requires anenormous amount of memory if T is large and α(x_(l) ^(t-1)) or β(x_(t+1)^(T) ! are saved in the computer memory. Alternatively, if both α(x_(l)^(t-1)) and β(x_(t+1) ^(T)) are calculated on-the-fly, it requires anenormous amount of the processing time.

In contrast, the present invention utilizes recursive algorithms tocalculate the sum and, as is clear from the equations below, needs astorage size which is independent of the sequence length. ##EQU8## Tocalculate this matrix sum recursively, denote ##EQU9## It is easy to seethat for any u≦v<w

    Ω.sub.ij (x.sub.u.sup.w)=Ω.sub.ij (x.sub.u.sup.v)P(x.sub.v+1.sup.w)+P(x.sub.u.sup.v)Ω.sub.ij (x.sub.v+1.sup.w)                                         (12)

Equation (12) together with equation (6) form the basis for the modifiedfoward-backward algorithms implemented by the present invention andillustrated by the flowchart in FIG. 3. Depending on the order of thematrix evaluation they can be treated as parallel, forward-only, orbackward-only algorithms. It follows from equation (1e) that sums of thefollowing form need to be calculated: ##EQU10##

If it is not necessary to calculate S_(ijT) for some other parameter, tosave memory and speed-up the computation recursive algorithms forcalculating S_(jT) directly can be derived. These algorithms can beobtained from the corresponding algorithms for S_(ijT) by summing bothsides of the algorithm equations with respect to i. Thus, for example,equation (12) becomes: ##EQU11## Parallel Algorithms

FIG. 3 is a flowchart of the steps performed by one embodiment of thepresent invention to perform step 14 of FIG. 1. First, the accumulators(i.e., locations in memory where the sums are accumulated) areinitialized in a known manner. Initialization is typically achieved byclearing the memory.

In steps 130-135, the present invention reads training data from storagedevice 10 shown in FIG. 1. Training data can be read in parallel.

In steps 140-145, P's and Ω's are calculated for each training data readin steps 131-133 in parallel. Steps 140-145 are performed as follows:

Equation (12) allows calculations to be performed on parallelprocessors. To be more specific, denote 1<t₁ <t₂ <. . . <t_(k) <T apartition of the interval 1, T!, then it follows that:

    P(x.sub.l.sup.t.sbsp.k+1)=P(x.sub.l.sup.t.sbsp.k)P(x.sub.t.sbsb.k.sub.+1.sup.t.sbsp.k+1)                                              (13a)

    Ω.sub.ij (x.sub.l.sup.t.sbsp.k+1)=Ω.sub.ij (x.sub.l.sup.t.sbsp.k)P(x.sub.t.sbsb.k.sub.+1.sup.t.sbsp.k-1)+P(x.sub.l.sup.t.sbsp.k)Ω.sub.ij (x.sub.t.sbsb.k.sub.+1.sup.t.sbsp.k+1) (13b)

These equations allow computations to be performed on parallelprocessors. Indeed, matrices P(x_(t).sbsb.k₊₁ ^(t).sbsp.k+1) and Ω_(ij)(x_(t).sbsb.k₊₁ ^(t).sbsp.k+1) can be evaluated independently onparallel processors. Then equations (13) can be applied in the followingway at steps 150 and 152 of FIG. 3:

Compute Ω_(ij) (x_(l) ^(t).sbsp.l) using any of the previously describedalgorithms. Then, using (13b) the following is obtained:

    Ω.sub.ij (x.sub.1.sup.t.sbsp.2)=Ω.sub.ij (x.sub.l.sup.t.sbsp.l)P(x.sub.t.sbsb.1.sub.+1.sup.t.sbsp.2)+P(x.sub.l.sup.t.sbsp.l)Ω.sub.ij (x.sub.t.sbsb.l.sub.+1.sup.t.sbsp.2)

and equation (13a) gives

    P(x.sub.1.sup.t.sbsp.2)=P(x.sub.l.sup.t.sbsp.l)P(x.sub.t.sbsb.l.sub.+1.sup.t.sbsp.2)

Applying equations (13) again results in

    Ω.sub.ij (x.sub.1.sup.t.spsp.3)=Ω.sub.ij (x.sub.1.sup.t.sbsp.2)P(x.sub.t.sbsb.2.sub.+1.sup.t.sbsp.3)+P(x.sub.1.sup.t.sbsp.2)Ω.sub.ij (x.sub.t.sbsb.2.sub.+1.sup.t.sbsp.3)

and

    P(x.sub.1.sup.t.sbsp.3)=P(x.sub.1.sup.t.sbsp.2)P(x.sub.t.sbsb.2.sub.+1.sup.t.sbsp.3)

and so on.

These equations are valid for any partition of the interval 1, T!.However, since matrices P(x_(t).sbsb.k₊₁ ^(t).sbsp.k+1) and Ω_(ij)(x_(t).sbsb.k₊₁ ^(t).sbsp.k+1) are the same for the repeated observationpatterns, the partition should take advantage of this property.

In the special case in which t_(k) =t and t_(k+1) =t+1,

    Ω.sub.ij (x.sub.t.sup.t)=W.sub.ij (x.sub.t)          (14)

and equations (13) take the form

    P(x.sub.1.sup.t+1)=P(x.sub.l.sup.t)P(x.sub.t+1) tm (15a)

    Ω.sub.ij (x.sub.l.sup.t+1)=Ω.sub.ij (x.sub.1.sup.t)P(x.sub.t+1)+P(x.sub.l.sup.t)W.sub.ij (x.sub.t+1) (15b)

Note that evaluation of Ω_(ij) (x_(l) ^(T)) is performed in theforward-only fashion. Alternatively, the backward-only algorithm canalso be applied starting with Ω_(ij) (x_(t).sbsb.k^(T)),P(x_(t).sbsb.k^(T)) and recursively computing

    P(x.sub.t.sbsb.k.sup.T)=P(x.sub.t.sbsb.k.sup.t.sbsp.k+1.sup.-1)P(x.sub.t.sbsb.k+1.sup.T)                                              (16a)

    Ω.sub.ij (x.sub.t.sbsb.k.sup.T)=P(x.sub.t.sbsb.k.sup.t.sbsp.k+1.sup.-1)Ω.sub.ij (x.sub.t.sbsb.k+1.sup.T)+Ω.sub.ij (x.sub.t.sbsb.k.sup.t.sbsp.k+1.sup.-1)P(x.sub.t.sbsb.k+1.sup.T) (16b)

However, it is obvious that the actual direction of evaluation is notimportant. The present invention can compute part of the matrix productsby forward-only algorithm and part of the products by backward-onlyalgorithm. In the parallel implementation, the products can be evaluatedat the moment when all the matrices in the right hand side of equations(13a) and (13b) are available. In this case the evaluation direction isdefined by a tree in which the value Ω_(ij) (x_(u) ^(w)) in the parentnode is evaluated using the values Ω_(ij) (x_(v) ^(u)) and Ω_(ij)(x_(v+1) ^(w)) of its children according to equations (13a) and (13b);Ω_(ij) (x_(l) ^(T)) is obtained at the root of the tree.

In the special case in which t_(k) =t and t_(k+1) =t+1, equations (16)become

    P(x.sub.t.sup.T)=P(x.sub.t)P(x.sub.t+1.sup.T)

    Ω.sub.ij (x.sub.t.sup.T)=P(x.sub.t)Ω.sub.ij (x.sub.t+1.sup.T)+W.sub.ij (x.sub.t)P(x.sub.t+1.sup.T)

Forward-Only Algorithms

FIG. 4 is a flowchart of the steps performed by another embodiment ofthe present invention to perform step 14 of FIG. 1. First, theaccumulators (i.e., locations in memory where the sums are accumulated)are initialized in a known manner. Initialization is typically achievedby clearing the memory.

In steps 230-235, the present invention reads training data from storagedevice 10 shown in FIG. 1. Training data can be read sequentially, or inparallel to speed up processing.

In steps 240-245, P's and Ω's are calculated for training data read insteps 230-235 sequentially or in parallel. In steps 250-254, α's and ω'sare calculated using input from steps 240-245 in a forward-only manner.

To speed up calculations and reduce computer memory requirements, matrixequations can be converted into vector equations by multiplyingequations (13) from the left by II:

    α(x.sub.1.sup.t.sbsp.t+1)=α(x.sub.1.sup.t.sbsp.k)P(x.sub.t.sbsb.k.sub.+1.sup.t.sbsp.k+1)                                  (17a)

    ω.sub.ij (x.sub.1.sup.t.sbsp.k+1)=ω.sub.ij (x.sub.1.sup.t.sbsp.k)P(x.sub.t.sbsb.k.sub.+1.sup.t.sbsp.k+1)+α(x.sub.1.sup.t.sbsp.k)Ω.sub.ij (x.sub.t.sbsb.k.sub.+1.sup.t.sbsp.k+1) (17b)

where

    ω.sub.ij (x.sub.1.sup.t.sbsp.k)=IIΩ.sub.ij (x.sub.1.sup.t.sbsp.k)

Ω_(ij) (x_(t).sbsb.k₊₁ ^(t).sbsp.k+1) can still be evaluated on parallelprocessors as in FIG. 3, but ω_(ij) (x₁ ^(t).sbsp.k+1) are evaluatedsequentially. Thus, equations (17) represent a forward-only algorithm.

If t_(k) =t and t_(k+1) =t+1, equations (17) become

    α(x.sub.1.sup.t+1)=α(x.sub.1.sup.t)P(x.sub.t+1) (18a)

    ω.sub.ij (x.sub.1.sup.t+1)=ω.sub.ij (x.sub.1.sup.t)P(x.sub.t+1)+α(x.sub.1.sup.t)W.sub.ij (x.sub.t+1) (18b)

This specialized version of the forward-only is disclosed in, forexample, N. Tan, "Adaptive Channel/Code Matching," Ph.D. dissertation,University of Southern California (Nov. 1993).

The sum in equation (9) is found as

    S.sub.ijt =ω.sub.ij (x.sub.1.sup.T)1

We still need to calculate P(x₁ ^(t)) using forward-only equation (15a)!for reestimating II according to equation (1a). If we assume that theinitial probability vector is fixed, there is no need to use equation(1a) and calculate P(x₁ ^(T)).

Since all products of probabilities tend to zero, to increase thecalculation accuracy and avoid underflow, it is necessary to scale theequations as disclosed in L. Rabiner et al., "Fundamentals of SpeechRecognition", Prentice Hall, Englewood Cliffs, N.J. (1993). Multiplyingright-hand sides of equations (18) by the common scale factor c_(t) weobtain

    α(x.sub.1.sup.t+1)=c.sub.t α(x.sub.1.sup.t)P(x.sub.t+1)

    ω.sub.ij (t+1,x)=c.sub.t ω.sub.ij (t,x)P(x.sub.t+1)+c.sub.t α(x.sub.1.sup.t)W.sub.ij (x.sub.t+1)

    P(x.sub.1.sup.t+1)=c.sub.t P(x.sub.1.sup.t)P(x.sub.t+1)

In principle, c_(t) can be any sequence since in reestimation equations(1a), (1b), (1c), and (1e) numerators and denominators are multiplied bythe same factor II_(t) c_(t). However, it is recommended in L. Rabineret al., "Fundamentals of Speech Recognition", Prentice Hall, EnglewoodCliffs, N.J. (1993) to normalize α(x₁ ^(t+1)). Thus,

    c.sub.t =1/α(x.sub.1.sup.t+1)1=1/α(x.sub.1.sup.t)P(x.sub.t+1)1

and we have the following relations between normalized andnon-normalized values

    α(x.sub.1.sup.t)=α(x.sub.1.sup.t)/α(x.sub.1.sup.t)1=.alpha.(x.sub.1.sup.t)/IIP(x.sub.1.sup.t)1

    P(x.sub.1.sup.t)=P(x.sub.1.sup.t)/IIP(x.sub.1.sup.t)1

Advantageously, for the selected scale factors, the observationlog-likelihood can be evaluated as ##EQU12## and equation (1a) is alsosimplified:

    II.sub.i,p+1 =II.sub.i,p β.sub.i (x.sub.1.sup.T), β(x.sub.1.sup.T)=P(x.sub.1.sup.T)1

In some applications, we assume that the process is stationary in whichcase the initial state probability distribution satisfies the followingequations

    II=IIA, II1=1

The stationary distribution can be estimated as a by-product of theBaum-Welch algorithm as disclosed in W. Turin et al., "Modeling ErrorSources in Digital Channels", IEEE Jour. Sel. Areas in Commun., 11, No.3, pp. 340-347 (1993).

Backward-Only Algorithm

FIG. 5 is a flowchart of the steps performed by another embodiment ofthe present invention to perform step 14 of FIG. 1. First, theaccumulators (i.e., locations in memory where the sums are accumulated)are initialized in a known manner. Initialization is typically achievedby clearing the memory.

In steps 330-335, the present invention reads training data from storagedevice 10 shown in FIG. 1. Training data is read sequentially, or inparallel to speed up processing.

In steps 340-345, P's and Ω's are calculated for training data read insteps 330-335 sequentially or in parallel. In steps 350-354, β's and v'sare calculate using input from steps 340-345 in a backward-only manner.

Equations (13) form a basis for the backward-only algorithm. Multiplyingboth sides of these equations from the right by 1 we replace matriceswith column vectors:

    β(x.sub.t.sbsb.k.sup.T)=P(x.sub.t.sbsb.k.sup.t.sbsp.k+1.sup.-1)β(x.sub.t.sbsb.k+1.sup.T)                                    (19a)

    ν.sub.ij (x.sub.t.sbsb.k.sup.T)=P(x.sub.t.sbsb.k.sup.t.sbsp.k+1.sup.-1)ν.sub.ij (x.sub.t.sbsb.k.sub.+1.sup.T)+Ω.sub.ij (x.sub.t.sbsb.k.sup.t.sbsp.k+1.sup.-1)β(x.sub.t.sbsb.k+1.sup.T) (19b)

where

    ν.sub.ij (x.sub.t.sbsb.k.sup.T)=Ω.sub.ij (x.sub.t.sbsb.k.sup.T)1

In particular, if t_(k) =t and t_(k+1) =t+1 we have

    β(x.sub.t.sup.T)=P(x.sub.t)β(x.sub.t+1.sup.T)    (20a)

    ν.sub.ij (x.sub.t.sup.T)=P(x.sub.t)ν.sub.ij (x.sub.t+1.sup.T)+W.sub.ij (x.sub.t)β(x.sub.t+1.sup.1) (20b)

The sum in (9) can be written as

    S.sub.ijT =IIν.sub.ij (x.sub.1.sup.T 0 )

Since equations (20) compute β(x₁ ^(t)) which is needed for reestimationof II, we do not need to compute P(x₁ ^(T)) as in the forward-onlyalogrithm. Thus, backward-only algorithm (20) is simpler than theforward-only algorithm.

As previously described, it is important to perform scaling to improvethe algorithm precision. The scaled version of the algorithm has theform:

    β(x.sub.t.sup.T)=c.sub.t P(x.sub.t)β(x.sub.t+1.sup.T)

    ν.sub.ij (t,x)=c.sub.t P(x.sub.t)ν.sub.ij (t+1,x)+c.sub.t W.sub.ij (x.sub.t)β(x.sub.t+1.sup.T)

    c.sub.t =1/β(x.sub.t.sup.T)=1/IIP(x.sub.t)β(x.sub.t+1.sup.T)

This selection of the scale factors allows us to calculate theobservation sequence likelihood as by-product of the Baum-Welchalgorithm.

Specific Implementation for Discrete Distributions and Mixtures ofDensities

In this section a detailed implementation of the present invention isgiven for the most widely used types of observation probabilitydistributions: discrete distributions and mixtures of densities. In thiscase of discrete distributions, equation (1c) has the followingsolution: ##EQU13## where δ(x,x_(t)) is the Kronecker delta functionδ(x,x_(t))=1 if x=x_(t) and δ(x,x_(t))=0 otherwise!. Thus we can useforward-only or backward-only algorithms with η_(t) =δ(x,x_(t)).

The forward-only algorithm according to equations (18) takes the form

    α(x.sub.1.sup.t+1)=α(x.sub.1.sup.t)P(x.sub.t+1) (21a)

    ω.sub.ij (t+1,x)=ω.sub.ij (t,x)P(x.sub.t+1)+α(x.sub.1.sup.t)δ(x,x.sub.t+1)W.sub.ij (x.sub.t+1)                                               (21b)

    P(x.sub.1.sup.t+1)=P(x.sub.1.sup.t)P(x.sub.t+1)            (21c)

Equation (21b) can be rewritten in the following explicit form ##EQU14##

The initial conditions for the algorithm are ##EQU15## equations (20)can be rewritten as ##EQU16## The scaled version of the algorithm takesthe form

    α(x.sub.1.sup.t+1)=c.sub.t α(x.sub.1.sup.t)P(x.sub.t+1) (23a)

    ω.sub.ij (t+1,x)=c.sub.t ω.sub.ij (t,x)P(x.sub.t+1)+c .sub.t α(x.sub.1.sup.t)δ(x,x.sub.t+1)W.sub.ij (x.sub.t+1) (23b)

    P(x.sub.1.sup.t+1)=c.sub.t P(x.sub.1.sup.t)P(x.sub.t+1)    (23c)

The Markov chain stationary distribution can be estimated by ##EQU17##

As pointed out before, the backward-only algorithm is simpler than theforward-only algorithm in the general case. According to equations (20),the backward-only algorithm can be described as follows:

Initialize

    β(x.sub.T+1.sup.T)=1, γ.sub.ij (T+1,x)=0        (24a)

For t=T,T-1, . . . , 1 compute

    β(x.sub.t.sup.T)=c.sub.t P(x.sub.t)β(x.sub.t+1.sup.T) (24b)

    γ.sub.ij (t,x)=c.sub.t P(x.sub.t)γ.sub.ij (t+1,x)+c.sub.t δ(x,x.sub.t)W.sub.ij (x.sub.t)β(x.sub.t+1.sup.T) (24c)

    c.sub.t =1/β(x.sub.t.sup.T)=1/II.sub.p P(x.sub.t)β(x.sub.t+1.sup.T)

Reestimate parameters: ##EQU18## Repeat iterations over p tillconvergence.

In the case of mixture densities ##EQU19## where x is the observationvector and c_(jk) are mixture coefficients. In this case the solution ofequation (1c) has the form ##EQU20## Thus, we need to calculate γ_(t)(j,k) based on equation (25b). After substitutions we see that ##EQU21##Thus, we need to calculate the sum in equation (8) with η_(t) =p_(jk)(x_(t) ;Ψ_(jk))/b_(j) (x_(t) ;Ψj). Forward-only algorithm (18) takes theform

    α(x.sub.1.sup.t+1)=α(x.sub.1.sup.t)P(x.sub.t+1) (27a)

    ω.sub.ijk (t+1)=ω.sub.ijk (t)P(x.sub.t+1)+α(x.sub.1.sup.t)W.sub.ijk (x.sub.t+1) (27b)

    P(x.sub.1.sup.t+1)=P(x.sub.1.sup.t)P(x.sub.t+1)            (27c)

where

    W.sub.ij (x.sub.t)=a.sub.ij p.sub.jk (x.sub.t ;Ψ.sub.jk)e.sub.i e.sub.j

is a matrix whose i,j-th element is a_(ij) p_(jk) (x_(t) ;Ψ_(jk)).

The Model parameters are reestimated according to equations (1) and (25)which can be written as ##EQU22## The backward-only algorithm has asimilar form. Also, the equations should be scaled to avoid numericalunderflow as is in the case of discrete distributions consideredpreviously.

For the Gaussian mixtures, we need to calculate ##EQU23## which can bewritten in the coordinate form: ##EQU24## where x_(t).sup.(i) is i-thcoordinate of x_(t). These sums can also be evaluated using theforward-only or backward-only algorithm with η_(t) =x_(t).sup.(i) p_(jk)(x_(t) ;Ψ_(jk))/b_(j) (x_(t) ;Ψ_(j)) for the first sum and η_(t)=x_(t).sup.(i) x_(t).sup.(l) p_(jk) (x_(t) ;Ψ_(jk))/b_(j) (x_(t) ;Ψ_(j))for the second sum.

The forward-only algorithm is given by the following equations

    ξ.sub.ijk (t+1)=ξ.sub.ijk (t)P(x.sub.t+1)+α(x.sub.1.sup.t)W.sub.ijk.sup.ξ (x.sub.t+1) (27d)

    ξ.sub.iljk (t+1)=ξ.sub.iljk (t)P(x.sub.t+1)+α(x.sub.1.sup.t)W.sub.ijk.sup.ξ (x.sub.t+1) (27e)

where

    W.sub.ijk.sup.ξ (x.sub.t)=x.sub.t.sup.(i) p.sub.jk (x.sub.t ;Ψ.sub.jk)A.sub.j e.sub.j

    W.sub.iljk.sup.ξ (x.sub.t)=x.sub.t.sup.(i) x.sub.t.sup.(i) x.sub.t.sup.(l) p.sub.jk (x.sub.t ;Ψ.sub.jk)A.sub.j e.sub.j

A_(j) is j-th column of matrix A. The reestimation equations can bewritten in the coordinate form as ##EQU25## All these equations need tobe scaled as previously described. Equations for the backward-only andparallel algorithm can be derived similarly.

Several embodiments of the present invention are specificallyillustrated and/or described herein. However, it will be appreciatedthat modifications and variations of the present invention are coveredby the above teachings and with in the purview of the appended claimswithout departing from the spirit and intended scope of the invention.For example, although the present invention is described in conjunctionwith a speech recognition training system, it also can be used in anyother type of training system that uses HMMs, such as a handwritingrecognition system. It can also be applied to modeling error sources incommunication channels, in signal quantization and compression (e.g.,speech or image signals, etc.).

What is claimed is:
 1. In a method of performing speech recognitiontraining using Hidden Markov Models comprising the steps of storingpreprocessed speech signals that include a plurality of observations,assigning initial model parameters, performing model reestimation usingthe Baum-Welch algorithm, and storing the model parameters if they haveconverged, wherein the improvement comprises:partitioning the storedspeech signals into at least a first and second set of training data;calculating a first summation using the first set of training data;calculating a second summation using the second set of training data;calculating a final summation using the first and second summations; andusing said final summation in the model reestimation step, wherein thefirst summation and the second summation are calculated on parallelprocessors.
 2. The method of claim 1 wherein the final summation iscalculated recursively in parallel.
 3. The method of claim 1 wherein thefinal summation is calculated recursively in a forward-only direction.4. The method of claim 3 wherein the final summation is calculatedrecursively in a backward-only direction.
 5. In a system for performingspeech recognition training using Hidden Markov Models comprising meansfor storing preprocessed speech signals that include a plurality ofobservations, means for assigning initial model parameters, means forperforming model reestimation using the Baum-Welch algorithm, and meansfor storing the model parameters if they have converged, wherein theimprovement comprises:means for partitioning the stored speech signalsinto at least a first and second set of training data; means forcalculating a first summation using the first set of training data, themeans further comprising a first processor; means for calculating asecond summation using the second set of training data, the meansfurther comprising a second processor; means for calculating a finalsummation using the first and second summations; and means for providingthe final summation to the means for performing model reestimation. 6.The system of claim 5 wherein the means for calculating a finalsummation uses a parallel recursion.
 7. The system of claim 5 whereinthe means for calculating a final summation uses a forward-onlyrecursion.
 8. The system of claim 5 wherein the means for calculating afinal summation uses a backward-only recursion.